Skip to content

Comments

feat: Add Hook Level Lineage to SQL hooks#61535

Merged
mobuchowski merged 1 commit intoapache:mainfrom
kacpermuda:feat-ol-hll-sql
Feb 17, 2026
Merged

feat: Add Hook Level Lineage to SQL hooks#61535
mobuchowski merged 1 commit intoapache:mainfrom
kacpermuda:feat-ol-hll-sql

Conversation

@kacpermuda
Copy link
Contributor

@kacpermuda kacpermuda commented Feb 6, 2026

Add hook-level lineage (HLL) reporting to SQL hooks via send_sql_hook_lineage
This PR introduces a standardized mechanism for SQL hooks to report execution metadata - SQL text, query parameters, job IDs, row counts, default database/schema - to the hook lineage collector using add_extra.

I also bumped the required sql-common version for all modified providers, so that the HLL is being emitted.

I've also added tests for most Hooks that use DbApiHook as base class, to make sure that even when some methods will be overwritten in the future, the Hook Level Lineage will still be sent (so for now we are mostly testing DbApiHook implementation multiple times, but if some db decides to overwrite run(), I need my test to fail so that new implementation also calls HLL collector).

Important context

The HLL collector is a no-op unless a collector is registered (e.g. by the OpenLineage provider). This means no runtime overhead for users who don't use lineage collection.

Motivation

Black-box operators (e.g. PythonOperator calling PostgresHook.run(sql)) currently produce no lineage. With this change, any registered collector can capture the SQL being executed, parse it for input/output datasets, and attach query IDs to lineage events - dramatically improving lineage quality without requiring operator-level changes.

Follow-up PRs

  • OpenLineage consumer: modify the OL provider to consume these extras, parse SQL for datasets, and attach query_id to OL events
  • BigQueryHook insert_job: mix of sql and non-sql lineage, will do in a separate PR.
  • Additional non-SQL hooks: extend HLL to more hooks beyond SQL

Was generative AI tooling used to co-author this PR?
  • Yes (please specify the tool below)

Co-authored by: Cursor following the guidelines


  • Read the Pull Request Guidelines for more information. Note: commit author/co-author name and email in commits become permanently public when merged.
  • For fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
  • When adding dependency, check compliance with the ASF 3rd Party License Policy.
  • For significant user-facing changes create newsfragment: {pr_number}.significant.rst or {issue_number}.significant.rst, in airflow-core/newsfragments.

@shahar1
Copy link
Contributor

shahar1 commented Feb 6, 2026

TLDR:

Add hook-level lineage (HLL) reporting to SQL hooks via send_sql_hook_lineage This PR introduces a standardized mechanism for SQL hooks to report execution metadata - SQL text, query parameters, job IDs, row counts, default database/schema - to the hook lineage collector using add_extra.

I also bumped the required sql-common version for all modified providers, so that the HLL is being emitted.

I've also added tests for most Hooks that use DbApiHook as base class, to make sure that even when some methods will be overwritten in the future, the Hook Level Lineage will still be sent (so for now we are mostly testing DbApiHook implementation multiple times, but if some db decides to overwrite run(), I need my test to fail so that new implementation also calls HLL collector).

Important context

The HLL collector is a no-op unless a collector is registered (e.g. by the OpenLineage provider). This means no runtime overhead for users who don't use lineage collection.

Motivation

Black-box operators (e.g. PythonOperator calling PostgresHook.run(sql)) currently produce no lineage. With this change, any registered collector can capture the SQL being executed, parse it for input/output datasets, and attach query IDs to lineage events - dramatically improving lineage quality without requiring operator-level changes.

Follow-up PRs

  • OpenLineage consumer: modify the OL provider to consume these extras, parse SQL for datasets, and attach query_id to OL events
  • BigQueryHook insert_job: mix of sql and non-sql lineage, will do in a separate PR.
  • Additional non-SQL hooks: extend HLL to more hooks beyond SQL
Was generative AI tooling used to co-author this PR?
  • Yes (please specify the tool below)

Co-authored by: Cursor following the guidelines

  • Read the Pull Request Guidelines for more information. Note: commit author/co-author name and email in commits become permanently public when merged.
  • For fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
  • When adding dependency, check compliance with the ASF 3rd Party License Policy.
  • For significant user-facing changes create newsfragment: {pr_number}.significant.rst or {issue_number}.significant.rst, in airflow-core/newsfragments.

CI currently fails :(

@kacpermuda
Copy link
Contributor Author

Thanks, I'll work on the fix, probably some rowcount mocks missing from new tests

@kacpermuda
Copy link
Contributor Author

I think the .pyi file is not yet generated for new lineage.py file, will look into that as well, for now applied fixes to the CI.

@kacpermuda
Copy link
Contributor Author

@potiuk I see that the script updating the sql stubs was removed when moving to new repo structure in #45964. The README_API.md is still documenting the prek approach that does not seem to work / exist. Do we have any checks for sql stubs being out of sync? Can't find any and last PR that modified the stubs #61144 seems to have updated them manually.

@kacpermuda
Copy link
Contributor Author

Pinging you since you were the person originally creating the script long time ago 😄 , and then moving us to new structure, maybe uv is handling it somehow in a way that I'm not aware of.

@dabla
Copy link
Contributor

dabla commented Feb 13, 2026

I'm just wondering, not sure if possible, but lineage is a cross cutting concern, so would a decorator not be a viable solution for this?

@kacpermuda
Copy link
Contributor Author

I'm just wondering, not sure if possible, but lineage is a cross cutting concern, so would a decorator not be a viable solution for this?

Yes, lineage as a cross-cutting concern is a common concept, and there are multiple ways to implement it. In this case of Hook Level Lineage, we can send assets directly via add_input_asset and add_output_asset, but each schema has its own asset factory and different kwargs to create asset uri. We also use add_extra for things like tracking SQL executions - but since different databases can expose different arguments, the tracking isn’t uniform.

A decorator could work, but I think the current approach is clearer and more explicit. Given that this helper is limited to SQL-based hooks, there’s no strong need for a global decorator in my opinion.

If you have some example of how it could look like so that it's easier to implement or easier to read lmk, I'm open to changing the approach, I just feel like there is not much to gain from decorator here - it'd be the same code, just implemented differently, and with explicit call we have full control over when we execute it - after cursor execution, before connection closed, and f.e. just once for insert_rows when cursor is called multiple times.

@dabla
Copy link
Contributor

dabla commented Feb 13, 2026

I'm just wondering, not sure if possible, but lineage is a cross cutting concern, so would a decorator not be a viable solution for this?

Yes, lineage as a cross-cutting concern is a common concept, and there are multiple ways to implement it. In this case of Hook Level Lineage, we can send assets directly via add_input_asset and add_output_asset, but each schema has its own asset factory and different kwargs to create asset uri. We also use add_extra for things like tracking SQL executions - but since different databases can expose different arguments, the tracking isn’t uniform.

A decorator could work, but I think the current approach is clearer and more explicit. Given that this helper is limited to SQL-based hooks, there’s no strong need for a global decorator in my opinion.

If you have some example of how it could look like so that it's easier to implement or easier to read lmk, I'm open to changing the approach, I just feel like there is not much to gain from decorator here - it'd be the same code, just implemented differently, and with explicit call we have full control over when we execute it - after cursor execution, before connection closed, and f.e. just once for insert_rows when cursor is called multiple times.

Thank you for your reply and explanation. Indeed it’s not as easy as it sounds, I would certainly not change the current implementation as I like the work you did on this PR.

This was just a thought of mine that maybe we should think of in the future to possibly make it more available out of the box. I wasn’t already thinking on all hooks in general, but purely on the one based on the DBApiHook, which would be a challenge already.

@mobuchowski mobuchowski merged commit 1835b69 into apache:main Feb 17, 2026
249 of 250 checks passed
@kacpermuda kacpermuda deleted the feat-ol-hll-sql branch February 17, 2026 12:37
choo121600 pushed a commit to choo121600/airflow that referenced this pull request Feb 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants